Performance Analysis Of All-to-All Communication on the Blue Gene/L Supercomputer

نویسندگان

  • Sameer Kumar
  • Philip Heidelberger
چکیده

All-to-all communication is a well known performance bottleneck for many applications. For such applications to scale to a large number of processors, optimizing all-to-all communication is critical. In this paper, we analyze the performance of all-to-all communication on the Blue Gene/L torus interconnection network, which has limited bisection bandwidth. The torus interconnect topology has link contention even for all-to-all communication operations with short messages. We observed that the performance of all-to-all communication also depends on the shape of the processor partition. We present a performance analysis of all-to-all communication on mesh and torus partitions of various shapes and sizes. We then present optimization schemes to enhance the performance of all-to-all communication. The large message optimization substantially improves all-to-all performance on an asymmetric torus. In particular, performance improved from about 70% to over 99% of peak on a 20,480 (40 × 32 × 16) node configuration, which was the largest machine to which we had access. The short message optimization can double all-to-all performance for very short messages.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer

BlueGene/L is a massively parallel supercomputer that is currently the fastest in the world. Implementing MPI, and especially fast collective communication operations can be challenging on such an architecture. In this paper, I will present optimized implementations of MPI collective algorithms on the BlueGene/L supercomputer and show performance results compared to the default MPICH2 algorithm...

متن کامل

Performance Measurements of the 3D FFT on the Blue Gene/L Supercomputer

This paper presents performance characteristics of a communicationsintensive kernel, the complex data 3D FFT, running on the Blue Gene/L architecture. Two implementations of the volumetric FFT algorithm were characterized, one built on the MPI library using an optimized collective all-to-all operation [2] and another built on a low-level System Programming Interface (SPI) of the Blue Gene/L Adv...

متن کامل

Versatile Communication Algorithms for Data Analysis

Large-scale parallel data analysis, where global information from a variety of problem domains is resolved in a distributed memory space, relies on communication. Three communication algorithms motivated by data analysis workloads—merge based reduction, swap based reduction, and neighborhood exchange—are presented, and their performance is benchmarked. These algorithms communicate custom data t...

متن کامل

Model and simulation of exascale communication networks

Exascale supercomputers will have millions or even hundreds of millions of processing cores and the potential for nearly billion-way parallelism. Exascale compute and data storage architectures will be critically dependent on the interconnection network. The most popular interconnection network for current and future supercomputer systems is the torus (e.g., k-ary, n-cube). This paper focuses o...

متن کامل

Toward the Graphics Turing Scale on a Blue Gene Supercomputer

We investigate raytracing performance that can be achieved on a class of Blue Gene supercomputers. We measure a 822 times speedup over a Pentium IV on a 6144 processor Blue Gene/L. We measure the computational performance as a function of number of processors and problem size to determine the scaling performance of the raytracing calculation on the Blue Gene. We find nontrivial scaling behavior...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007